Colombia COVID-19 - Central region
Our project
We decided to do central Colombia, basically because it is where the capital is.
We built a model for the number of confirmed cases using all the others covariates (plus some we created) and we estimated the predictive accuracy of our selected model.
We decided to consider as central Colombia the following departments/districts: Bogotà DC, Boyacá, Tolima, Cundinamarca, Meta, Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá, Antioquia, Santander, Casanare.
Loading the dataset
colombia_covid <- as.data.frame(read_csv("data/datasets_567855_1056808_Casos1.csv"))
colombia_covid_colnames <- c("ID de caso", "Fecha de diagnóstico", "Ciudad de ubicación", "Departamento o Distrito", "Atención", "Edad" , "Sexo", "Tipo", "País de procedencia")
colnames(colombia_covid)[5] <- "Atención"
colnames(colombia_covid)[8] <- "Tipo"
# slicing the main dataset
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá", "Quindío", "Cauca",
"Valle del Cauca", "Risaralda", "Caldas", "Boyacá", "Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in% central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]Expanding the dataset
Description of variables
ID de caso: ID of the confirmed case.
Fecha de diagnóstico: Date in which the disease was diagnosed.
Ciudad de ubicación: City where the case was diagnosed.
Departamento o Distrito: Department or district where the city belongs to.
Atención: Situation of the patient: recovered, at home, at the hospital, at the ICU or deceased.
Edad: Age of the confirmed case.
Sexo: Sex of the confirmed case.
Tipo: How the person got infected: in Colombia, abroad or unknown.
País de procedencia: Country of origin if the person got infected abroad.
Map
Here we can see our selected cities. The color of the pins is related with the number of cases: if they are less than \(10\) the color is “green”, if they are less than \(100\) the color is “orange”, otherwise it is “red”.
Preprocessing
We had to clean the dataset:
We transformed the
Fecha de diagnósticovariable into aDatetype variable,we fixed the variable
Id de caso(since we removed some departments, so some lines, the numbers weren’t consecutive),we created a variable
Grupo de edad,we cleaned the column
País de procedencia(replaced cities with the country) and created the variableContinente de procedencia(as the first is too fragmented we thought to consider the continents).
## ID de caso Fecha de diagnóstico Ciudad de ubicación Departamento o Distrito
## 1 1 2020-03-06 Bogotá Bogotá D.C.
## 2 2 2020-03-09 Buga Valle del Cauca
## 3 3 2020-03-09 Medellín Antioquia
## Atención Edad Sexo Tipo País de procedencia Grupo de edad
## 1 Recuperado 19 F Importado Italia 19_30
## 2 Recuperado 34 M Importado España 31_45
## 3 Recuperado 50 F Importado España 46_60
## Continente de procedencia
## 1 Europa
## 2 Europa
## 3 Europa
New dataset I
## Date Elapsed time New cases/day Cumulative cases
## 1 2020-03-06 0 1 1
## 2 2020-03-09 3 2 3
## 3 2020-03-11 5 5 8
## 4 2020-03-12 6 2 10
## 5 2020-03-13 7 3 13
## 6 2020-03-14 8 8 21
## 7 2020-03-15 9 13 34
## 8 2020-03-16 10 12 46
## 9 2020-03-17 11 12 58
## 10 2020-03-18 12 24 82
New dataset II
## Date Elapsed time Department Department ID New cases/day
## 1 2020-03-09 3 Antioquia 1 1
## 2 2020-03-11 5 Antioquia 1 3
## 3 2020-03-14 8 Antioquia 1 3
## 4 2020-03-15 9 Antioquia 1 1
## 5 2020-03-19 13 Antioquia 1 3
## 6 2020-03-20 14 Antioquia 1 11
## 7 2020-03-21 15 Antioquia 1 3
## 8 2020-03-22 16 Antioquia 1 5
## 9 2020-03-23 17 Antioquia 1 22
## 10 2020-03-25 19 Antioquia 1 8
## Cumulative cases/Department Mean age
## 1 1 50.00000
## 2 4 35.66667
## 3 7 30.00000
## 4 8 55.00000
## 5 11 52.33333
## 6 22 39.81818
## 7 25 31.00000
## 8 30 45.40000
## 9 52 36.45455
## 10 60 29.75000
## Date Elapsed time Department Department ID New cases/day
## 1 2020-03-09 3 Antioquia 1 1
## 2 2020-03-11 5 Antioquia 1 3
## 16 2020-04-02 27 Antioquia 1 20
## 17 2020-03-06 0 Bogotá D.C. 2 1
## 18 2020-03-11 5 Bogotá D.C. 2 2
## 40 2020-04-02 27 Bogotá D.C. 2 70
## 41 2020-03-15 9 Cundinamarca 7 1
## 42 2020-03-16 10 Cundinamarca 7 1
## 54 2020-04-01 26 Cundinamarca 7 4
## 55 2020-03-15 9 Risaralda 10 1
## Cumulative cases/Department Mean age
## 1 1 50.00000
## 2 4 35.66667
## 16 127 44.80000
## 17 1 19.00000
## 18 3 25.00000
## 40 542 43.18571
## 41 1 29.00000
## 42 2 24.00000
## 54 42 30.75000
## 55 1 22.00000
Exploring the dataset
Other plots
Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))
ggplot(data=colombia_covid, aes(x=`Departamento o Distrito`, fill = Sexo)) +
geom_bar(data = subset(colombia_covid, Sexo == "F")) +
geom_bar(data = subset(colombia_covid, Sexo == "M"), aes(y=..count..*(-1))) +
scale_y_continuous(breaks = brks,
labels = lbls) +
coord_flip() +
labs(title="Spread of the desease across genders",
y = "Number of cases",
x = "Department",
fill = "Gender") +
theme_tufte() +
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) +
scale_fill_brewer(palette = "Dark3") #compute percentage so that we can label more precisely the pie chart
age_groups_pie <- colombia_covid %>%
group_by(`Grupo de edad`) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(`Grupo de edad`))
age_groups_pie$label <- scales::percent(age_groups_pie$per)
age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(`Grupo de edad`))) +
geom_bar(stat="identity", width = 1) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill="Age groups",
x=NULL,
y=NULL,
title="Distribution of the desease across ages") +
coord_polar(theta = "y") +
#geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
geom_label_repel(aes(x=1, y=cumsum(per) - per/2, label=label), size=3, show.legend = F, nudge_x = 0) +
guides(fill = guide_legend(title = "Group"))
age_pie Age-Sex plot
Tipo plot
theme_set(theme_classic())
ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_brewer(palette = "Set3") +
geom_histogram(aes(fill=Tipo), width = 0.8, stat="count") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across type",
x = "Date of confirmation",
fill = "Type")Tipo
I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:
type_pie <- colombia_covid %>%
group_by(Tipo) %>%
count() %>%
ungroup() %>%
mutate(per=`n`/sum(`n`)) %>%
arrange(desc(Tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie<-type_pie[names(type_pie)!="per"]
colnames(type_pie)<-c("Tipo", "Total number", "Percentage")
type_pie## # A tibble: 3 x 3
## Tipo `Total number` Percentage
## <chr> <int> <chr>
## 1 Relacionado 291 29.3%
## 2 Importado 467 47.0%
## 3 En estudio 235 23.7%
Continent
Now let’s plot a pie chart to be able to see the distribution of cases across the continents.
The majority of the cases in the country are people that got infected inside Colombia. Then, people that contracted the disease abroad came mainly from Europe, followed by North America and Central America.
The frequentist approach
Train/test split
We splitted the data so to leave out the last three points for prediction, because we have few points and because in this models it has no sense to predict a week, because the situation changes really fast.
Poisson with Elapsed time as predictor
poisson1 <- glm(`Cumulative cases` ~ `Elapsed time`, data=cases[1:22, ], family=poisson)
pred.pois <- poisson1$fitted.values
res.st <- (cases$`Cumulative cases`[1:22] - pred.pois)/sqrt(pred.pois)
#n=25
#k=2
#n-k=23
print(paste("Estimated overdispersion", sum(res.st^2)/23))## [1] "Estimated overdispersion 7.87621495633838"
poisson1.pred <- predict(poisson1, newdata = cases[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.pred - cases$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 320.101909947755"
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1$aic)## [1] "AIC: 334.851308168759"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1$null.deviance, deviance(poisson1)), 2))## [1] "Null deviance: 4909.62" "Residual deviance: 194.49"
Using cases_relev_dep
poisson1A <- glm(`Cumulative cases/Department` ~ `Elapsed time`, data=cases_relev_dep, family=poisson)
summary(poisson1A)##
## Call:
## glm(formula = `Cumulative cases/Department` ~ `Elapsed time`,
## family = poisson, data = cases_relev_dep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.9994 -4.7902 -2.1236 0.8586 16.3553
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.982659 0.062709 15.67 <2e-16 ***
## `Elapsed time` 0.167306 0.002779 60.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 8732.6 on 82 degrees of freedom
## Residual deviance: 3855.1 on 81 degrees of freedom
## AIC: 4285.4
##
## Number of Fisher Scoring iterations: 5
## [1] "MSE: 0.707226967019818"
## [1] "AIC: 4285.42080929419"
We can see that the AIC is enormous.
Using New cases/day
poisson1B <- glm(`New cases/day` ~ `Elapsed time`, data=cases, family=poisson)
#print(paste("Estimated overdispersion", sum(res.st^2)/23))
par(mfrow=c(2,2))
plot(poisson1B)## [1] "MSE: 0.276649607627415"
## [1] "AIC: 317.731647373115"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1B$null.deviance, deviance(poisson1B)), 2))## [1] "Null deviance: 870.39" "Residual deviance: 191.78"
Poisson with Elapsed time plus Elapsed time^2 as predictor
poisson1 <- glm(`Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2), data=cases[1:22, ], family=poisson)
pred.pois <- poisson1$fitted.values
res.st <- (cases$`Cumulative cases`[1:22] - pred.pois)/sqrt(pred.pois)
#n=25
#k=2
#n-k=23
print(paste("Estimated overdispersion", sum(res.st^2)/23))## [1] "Estimated overdispersion 0.770276221294292"
poisson1.pred <- predict(poisson1, newdata = cases[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson1.pred - cases$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 193.929575502612"
#paste("MSE:", mean(poisson1$residuals^2))
#print(sprintf("MSE: %0.2f", sum(poisson1$residuals^2)/poisson1$df.residual))
#print(sprintf("MSE: %0.2f", anova(poisson1)['Residuals', 'Mean Sq']))
paste("AIC:", poisson1$aic)## [1] "AIC: 159.884142409817"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1$null.deviance, deviance(poisson1)), 2))## [1] "Null deviance: 4909.62" "Residual deviance: 17.53"
Poisson with Elapsed time plus Sexo
poisson2 <- glm(`Cumulative cases` ~ `Elapsed time` + Sexo_M, data=data1[1:22, ], family=poisson)
poisson2.pred <- predict(poisson2, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson2.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 319.201779076467"
## [1] "AIC: 336.17935923303"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson2$null.deviance, deviance(poisson2)), 2))## [1] "Null deviance: 4909.62" "Residual deviance: 193.82"
Poisson with Elapsed time plus Group de edad
poisson3 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1[1:22, ], family=poisson)
pred.pois3 <- poisson3$fitted.values
res.st3 <- (data1$`Cumulative cases` - pred.pois3)/sqrt(pred.pois3)
#n=25
#k=7
#n-k=18
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st3^2)/18))## [1] "Estimated overdispersion 10681.8735263657"
poisson3.pred <- predict(poisson3, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson3.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 726.134223682346"
## [1] "AIC: 277.80626391695"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson3$null.deviance, deviance(poisson3)), 2))## [1] "Null deviance: 4909.62" "Residual deviance: 127.45"
Poisson with Elapsed time, Age and Departments as predictors
poisson4 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`, data=data1[1:22, ], family=poisson)
par(mfrow=c(2,2))
plot(poisson4)pred.pois4 <- poisson4$fitted.values
res.st4 <- (data1$`Cumulative cases` - pred.pois4)/sqrt(pred.pois4)
#n=25
#k=19
#n-k=6
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4^2)/6))## [1] "Estimated overdispersion 112122.28078111"
poisson4.pred <- predict(poisson4, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson4.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 50497.8051676042"
## [1] "AIC: 176.02121689047"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson4$null.deviance, deviance(poisson4)), 2))## [1] "Null deviance: 4909.62" "Residual deviance: 1.66"
Poisson with Elapsed time, Age and Departments as predictors for new dataset
poisson4_2 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima`, data=data2[1:122, ], family=poisson)
# MISSING `Departamento o Distrito_Valle del Cauca` !!!!!!!!!!!!!!!!!!!!!!!!
par(mfrow=c(2,2))
plot(poisson4_2)pred.pois4_2 <- poisson4_2$fitted.values
res.st4_2 <- (data2$`Cumulative cases` - pred.pois4_2)/sqrt(pred.pois4_2)
#n=25
#k=19
#n-k=6
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4_2^2)/6))## [1] "Estimated overdispersion 8470378.93364402"
poisson4.pred_2 <- predict(poisson4_2, newdata = data2[123:126, ], type="response")
paste("RMSE:", sqrt(mean((poisson4.pred_2 - data2$`Cumulative cases`[123:127])^2)))## [1] "RMSE: NA"
## [1] "AIC: 21900.1694163944"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson4_2$null.deviance, deviance(poisson4_2)), 2))## [1] "Null deviance: 2295770.44" "Residual deviance: 20621.13"
Poisson with Elapsed time, Age and Departments as predictors for New cases/day
poisson4bis <- glm(`New cases/day` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`, data=data1[1:22, ], family=poisson)
par(mfrow=c(2,2))
plot(poisson4bis)pred.pois4bis <- poisson4bis$fitted.values
res.st4bis <- (data1$`Cumulative cases` - pred.pois4bis)/sqrt(pred.pois4bis)
#n=25
#k=19
#n-k=6
print(paste("Estimated overdispersion", est.overdispersion <- sum(res.st4bis^2)/6))## [1] "Estimated overdispersion 146440.312038581"
poisson4bis.pred <- predict(poisson4bis, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson4bis.pred - data1$`New cases/day`[23:25])^2)))## [1] "RMSE: 7193.89045584116"
## [1] "AIC: 145.362756582981"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson4bis$null.deviance, deviance(poisson4bis)), 2))## [1] "Null deviance: 583.23" "Residual deviance: 4.8"
Poisson with Elapsed time, Age, Departments and Continent of origin as predictors
poisson5 <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`+`Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`, data=data1, family=poisson)
#par(mfrow=c(2,2))
#plot(poisson5)
pred.pois5 <- poisson5$fitted.values
res.st5 <- (data1$`Cumulative cases` - pred.pois5)/sqrt(pred.pois5)
plot(pred.pois5, res.st5)
abline(h=0, lty=3, col="gray75")poisson5.pred <- predict(poisson5, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson5.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 1.04038468721165e-11"
## [1] "AIC: 212.203576123254"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson5$null.deviance, deviance(poisson5)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 0"
poisson6 <- glm(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia` + `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` + `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`, data = data1, family=poisson)
par(mfrow=c(2,2))
plot(poisson6)poisson6.pred <- predict(poisson6, newdata = data1[23:25, ], type="response")
paste("RMSE:", sqrt(mean((poisson6.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 90.4726620577828"
## [1] "AIC: 393.672801484039"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson6$null.deviance, deviance(poisson6)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 215.47"
We confirmed that the continent of origin variable is not significant at all when being a predictor just together with the time variable.
ANOVA to compare the Poisson models
#anova(poisson1, poisson3, poisson4, poisson5, test="Chisq")
anova(poisson1, poisson3, poisson4, test="Chisq")## Analysis of Deviance Table
##
## Model 1: `Cumulative cases` ~ `Elapsed time` + I(`Elapsed time`^2)
## Model 2: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+`
## Model 3: `Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` +
## `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` +
## `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 19 17.527
## 2 15 127.450 4 -109.92
## 3 3 1.665 12 125.78 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Predictive accuracy of the Poisson model for Cumulative cases
Predicting with a \(95\%\) confidence interval
New dataset
Predictive accuracy of the Poisson model for New cases/day
Predicting with a \(95\%\) confidence interval
Quasi Poisson with Elapsed time as predictor
poisson1quasi <- glm(`Cumulative cases` ~ `Elapsed time`, data=cases, family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)pred.poisq <- poisson1quasi$fitted.values
res.stq <- (data1$`Cumulative cases` - pred.poisq)/sqrt(summary(poisson1quasi)$dispersion*pred.poisq)
print(paste("Estimated overdispersion", sum(res.stq^2)/23))## [1] "Estimated overdispersion 0.999988649853625"
## [1] "MSE: 0.162079239159427"
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson1quasi$null.deviance, deviance(poisson1quasi)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 280.62"
Quasi Poisson with Elapsed time and Age as predictor
#Let's apply a quasi poisson and see what happens
poisson2quasi <- glm(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1, family=quasipoisson)
par(mfrow=c(2,2))
plot(poisson1quasi)pred.poisq2 <- poisson2quasi$fitted.values
res.stq2 <- (data1$`Cumulative cases` - pred.poisq2)/sqrt(summary(poisson2quasi)$dispersion*pred.poisq2)
print(paste("Estimated overdispersion", sum(res.stq2^2)/18))## [1] "Estimated overdispersion 0.999984520837828"
## [1] "MSE: 0.148504779911293"
## [1] "AIC: NA"
paste(c("Null deviance: ", "Residual deviance:"),
round(c(poisson2quasi$null.deviance, deviance(poisson2quasi)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 200.75"
Negative Binomial with Elapsed time as predictor
#n=25, k=2, n-k=23
stdres <- rstandard(nb1)
print(paste("Estimated overdispersion", sum(stdres^2)/23))## [1] "Estimated overdispersion 1.14231711769198"
nb1.pred <- predict(nb1, newdata = data1[23:25, ], type = "response")
paste("RMSE:", sqrt(mean((nb1.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 996.537615924091"
## [1] "AIC: 210.519644324275"
## [1] "Null deviance: 525.01" "Residual deviance: 23.38"
Negative Binomial with Elapsed time plus Age as predictors
nb2 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`, data=data1[1:22, ])
par(mfrow=c(2,2))
plot(nb2)nb2.pred <- predict(nb2, newdata = data1[23:25, ], type = "response")
paste("RMSE:", sqrt(mean((nb2.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 1754.16392440364"
## [1] "AIC: 212.848620450058"
## [1] "Null deviance: 755.4" "Residual deviance: 22.99"
Negative Binomial with Elapsed time plus Department as predictors
nb3 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`, data=data1[1:22, ])
par(mfrow=c(2,2))
plot(nb3)nb3.pred <- predict(nb3, newdata = data1[23:25, ], type = "response")
paste("RMSE:", sqrt(mean((nb3.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 802.015356603543"
## [1] "AIC: 177.900778777897"
## [1] "Null deviance: 4908.99" "Residual deviance: 11.54"
Negative Binomial with Elapsed time plus Continent of origin as predictors
nb4 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Continente de procedencia_Asia`+`Continente de procedencia_Centroamérica`+`Continente de procedencia_Colombia`+`Continente de procedencia_Europa`+`Continente de procedencia_Norteamérica`+`Continente de procedencia_Sudamerica`, data=data1[1:22, ])
par(mfrow=c(2,2))
plot(nb4)nb4.pred <- predict(nb4, newdata = data1[23:25, ], type = "response")
paste("RMSE:", sqrt(mean((nb4.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 2821.91531049303"
## [1] "AIC: 167.282393448055"
## [1] "Null deviance: 4908.81" "Residual deviance: 12.92"
Negative Binomial with Elapsed time, Age and Departments as pedictors
nb5 <- glm.nb(`Cumulative cases` ~ `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.`+`Departamento o Distrito_Boyacá`+`Departamento o Distrito_Caldas`+`Departamento o Distrito_Casanare`+`Departamento o Distrito_Cauca`+`Departamento o Distrito_Cundinamarca`+`Departamento o Distrito_Meta`+`Departamento o Distrito_Quindío`+`Departamento o Distrito_Risaralda`+`Departamento o Distrito_Santander`+`Departamento o Distrito_Tolima`+`Departamento o Distrito_Valle del Cauca`, data=data1[1:22, ])
par(mfrow=c(2,2))
plot(nb5)# Calculating overdispersion n=25 k=19 n-k=6
stdres <- rstandard(nb5)
print(paste("Estimated overdispersion", sum(stdres^2)/6))## [1] "Estimated overdispersion NaN"
nb5.pred <- predict(nb5, newdata = data1[23:25, ], type = "response")
paste("RMSE:", sqrt(mean((nb5.pred - data1$`Cumulative cases`[23:25])^2)))## [1] "RMSE: 50497.8210064501"
## [1] "AIC: 178.022820115089"
## [1] "Null deviance: 4909.27" "Residual deviance: 1.66"
Applying ANOVA to compare the negative binomial models
We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the fifth model is in fact better than the first model.
## Likelihood ratio tests of Negative Binomial Models
##
## Response: Cumulative cases
## Model
## 1 Elapsed time
## 2 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+`
## 3 `Elapsed time` + `Grupo de edad_19_30` + `Grupo de edad_31_45` + `Grupo de edad_46_60` + `Grupo de edad_60_75` + `Grupo de edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + \n `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 1.594523e+01 20 -204.5196
## 2 2.523182e+01 15 -196.8486 1 vs 2 5 7.671024 1.753224e-01
## 3 2.825220e+06 3 -138.0228 2 vs 3 12 58.825800 3.691871e-08
Predictive accuracy of the Negative Binomial model
Predicting with a \(95\%\) confidence interval
The Bayesian approach
Poisson regression
As a first attempt, we fit a simple Poisson regression:
\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]
with \(i = 1,\dots,134\), being \(134\) the number of rows of our dataset, and \(y_i\) represents the number of cases.
For what concerns the stan program, we used the function poisson_log_rng to describe the distribution of \(y_i\), namely the number of cases each day and the function poisson_log_lpmf to specify the likelihood.
Posterior predictive check
The fit is not satisfactory, it is probably due to overdispersion, we can check the residuals to confirm this hypothesis.
Residual check
#in this way we check the standardized residuals
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The variance of the residuals increases as the predicted value increase. The standardized residuals should have mean 0 and standard deviation 1 (hence the lines at \(+2\) and \(-2\) indicates approximate \(95\%\) error bounds).
The plot of the standardized residuals indicates a large amount of overdispersion.
Classically the problem of having overdispersed data is solved using the negative binomial model instead of the Poisson’s one.
Negative Binomial model
We try to improve the previous model using the Negative Binomial model:
\[ ln(\lambda_i) = \alpha + \beta\cdot elapsed\_time_i \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta \sim \mathcal{N}(0.25,1) \]
Where the parameter \(\phi\) is called precision and it is such that:
\[ E[y_i] = \lambda_i \\ Var[y_i] = \lambda_i + \frac{\lambda_i^2}{\phi} \]
again \(i=1,\dots,134\). As \(\phi \rightarrow \infty\) the negative binomial approaches the Poisson distribution.
The stan function that we use here are neg_binomial_2_log_rng to specify the distribution of \(y_i\) and the function neg_binomial_2_log_lpmf for the likelihood.
Posterior predictive check
samples_NB <- rstan::extract(fit2)
y_rep <- samples_NB$y_rep
ppc_dens_overlay(y = model.data$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)The situation is better now, but still we have too many residuals outside the \(95\%\) interval.
Accuracy across departments
ppc_stat_grouped(
y = model.data$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)We should take into account the differences across departments.
Multilevel Negative Binomial regression
We try to fit the following model, which also includes Age as covariat:
\[ ln(\lambda_i) = \alpha + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age \\ y_i \sim \mathcal{Negative Binomial}(\lambda_i, \phi) \\ \alpha \sim \mathcal{N}(0,1) \\ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \]
Posterior predictive check
samples_NB2 <- rstan::extract(fit3)
y_rep <- samples_NB2$y_rep
ppc_dens_overlay(y = model.data2$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_NB2$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (model.data2$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Accuracy across departments
ppc_stat_grouped(
y = model.data2$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)Hierarchical model
In order to improve the fit, we fit a model with department-specific intercept term.
So the varying intercept model that we take into account is now:
\[ ln(\lambda_{i,d}) = \alpha_d + + \beta_{time}\cdot elapsed\_time_i + \beta_{age}\cdot age_i\\ \alpha_d \sim \mathcal{N}(\mu + \beta_{pop}\cdot pop_d + \beta_{sur}\cdot surface_d + \beta_{dens} \cdot density_d, \sigma_{\alpha})\\ y_i \sim \mathcal{Negative Binomial}(\lambda_{i,d}, \phi) \]
The priors used for the above model are the following:
\[ \beta_{time} \sim \mathcal{N}(0.5,1) \\ \beta_{age} \sim \mathcal{N}(0,1) \\ \psi \sim \mathcal{N}(0,1) \]
being \(\psi = [\beta_{pop}, \beta_{sur}, \beta_{dens}]\).
New dataset
We added the following covariats into the dataset:
People: millions of inhabitants for each region;Surface: \(km^3\), extent of each region;Density: \(\frac{people}{km^2}\), density of the population in each region.
The model is:
Posterior predictive check
samples_hier <- rstan::extract(fit.4)
y_rep <- samples_hier$y_rep
ppc_dens_overlay(y = data.hier.NB.complete$cases, y_rep[1:200,]) Residual check
mean_inv_phi <- mean(samples_hier$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_residual <- (data.hier.NB.complete$cases - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_residual) + hline_at(2) + hline_at(-2)Very few points are now outside the \(95\%\) confidence interval.
Accuracy across departments
ppc_stat_grouped(
y = data.hier.NB.complete$cases,
yrep = y_rep,
group = cases_dep$Department,
stat = "mean",
binwidth = 0.2
)We can clearly see that the accuracy across the departments has significantly increased with respect to the previous models.
LOOIC
The Leave-One-Out cross validation is a method for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulation of the parameters values.
Plot the looic to compare models:
loo.all.deps<-c(loo.model.Poisson[3], loo.model.NB[3], loo.model.NB2[3], loo.model.NB.hier[3])
sort.loo.all.deps<- sort.int(loo.all.deps, index.return = TRUE)$x
par(xaxt="n")
plot(sort.loo.all.deps, type="b", xlab="", ylab="LOOIC", main="Model comparison")
par(xaxt="s")
axis(1, c(1:4), c("Poisson", "NB-sl", "NB-ml",
"hier")[sort.int(loo.all.deps,
index.return = TRUE)$ix],
las=2)